Comparing the Statistical Fate of Paralogous and Orthologous Sequences.
نویسندگان
چکیده
For several decades, sequence alignment has been a widely used tool in bioinformatics. For instance, finding homologous sequences with a known function in large databases is used to get insight into the function of nonannotated genomic regions. Very efficient tools like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model we can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. For two homologous sequences, the corresponding probability is much higher, which allows us to identify them. Here we focus on the distribution of lengths of exact sequence matches between protein-coding regions of pairs of evolutionarily distant genomes. We show that this distribution exhibits a power-law tail with an exponent [Formula: see text] Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically and computationally that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and noncoding parts of genomes, thus providing a better understanding of statistical properties of genomic sequences and their evolution.
منابع مشابه
The Impact of Paralogy on Phylogenomic Studies – A Case Study on Annelid Relationships
Phylogenomic studies based on hundreds of genes derived from expressed sequence tags libraries are increasingly used to reveal the phylogeny of taxa. A prerequisite for these studies is the assignment of genes into clusters of orthologous sequences. Sophisticated methods of orthology prediction are used in such analyses, but it is rarely assessed whether paralogous sequences have been erroneous...
متن کاملRNAscClust: clustering RNA sequences using structure conservation and graph based motifs
Motivation Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensator...
متن کاملSequence analysis RNAscClust: clustering RNA sequences using structure conservation and graph based motifs
Motivation: Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensato...
متن کاملVertebrate Paralogous Conserved Noncoding Sequences May Be Related to Gene Expressions in Brain
Vertebrate genomes include gene regulatory elements in protein-noncoding regions. A part of gene regulatory elements are expected to be conserved according to their functional importance, so that evolutionarily conserved noncoding sequences (CNSs) might be good candidates for those elements. In addition, paralogous CNSs, which are highly conserved among both orthologous loci and paralogous loci...
متن کاملHuman-chimpanzee alignment: Ortholog exponentials and paralog power laws
Genomic subsequences conserved between closely related species such as human and chimpanzee exhibit an exponential length distribution, in contrast to the algebraic length distribution observed for sequences shared between distantly related genomes. We find that the former exponential can be further decomposed into an exponential component primarily composed of orthologous sequences, and a trun...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Genetics
دوره 204 2 شماره
صفحات -
تاریخ انتشار 2016